Scientific Data Integration: Wrapping Textual Documents with a Database View Mechanism and an XML Engine
نویسنده
چکیده
Nowadays scientiic data is inevitably digital and stored in a wide variety of formats in heterogeneous systems. Scientists need to access an integrated view of remote or local heterogeneous data sources with advanced data analyzing and visualization tools. Building a digital library for scientiic data requires accessing and manipulating data extracted from at les or documents retrieved from the Web. We present an approach to querying Flat Files as well as Web datasources through an object database view based a database system and a wrapper. Generally a wrapper has two tasks: it rst sends a query to the source to retrieve data and, secondly builds the expected output with respect to the virtual structure. Scientiic data servers and in particular the ones publicly available on the Web, usually provide information retrieval techniques to access data. Our wrappers are composed of a retrieval component based on an intermediate object view mechanism called search views mapping the source capabilities (including full-text retrieval accesses) to attributes, and a XML engine to perform respectively these two tasks. If the retrieval component is speciic to each datasource, this approach shows that the extraction component, the XML engine, can be common. We describe our system and focus on the retrieval component of the Object-Web Wrapper (OWW) for Web sources. The originality of our approach consists of (1) a common wrapper architecture for at les and Web datasources sharing a XML engine for data extraction, (2) a generic view mechanism to access datasources with limited capabilities, and (3) the representation of hyperlinks as abstract attributes in the object view as well as their use in the search view. Our approach has been developed and demonstrated as part of the multidatabase system supporting queries via uniform Object Protocol Model (OPM) interfaces.
منابع مشابه
ARAXA: an object-relational approach to store active XML documents
Active XML (AXML) documents combine extensional XML data with intentional data defined through Web service calls. The dynamic properties of these documents pose challenges to both storage and data materialization techniques. We present ARAXA, a non-intrusive approach to store AXML documents. It takes advantage of complex objects from object-relational DBMS to represent both extensional and inte...
متن کاملEvaluating Performance and Quality of XML-Based Similarity Joins
A similarity join correlating fragments in XML documents, which are similar in structure and content, can be used as the core algorithm to support data cleaning and data integration tasks. For this reason, built-in support for such an operator in an XML database management system (XDBMS) is very attractive. However, similarity assessment is especially difficult on XML datasets, because structur...
متن کاملIntegration of IR into an XML Database
Structure matching has been the focus and strength of standard XML querying. However, textual content is still an essential component of XML data. It is therefore important to extend the standard XML database engine to allow for “Information Retrieval” style queries, namely, “keyword” based retrieval and “result ranking”. In this paper, we describe our effort in integrating information retrieva...
متن کاملXML Tag Information Management System – A Workbench for Ontology-based Knowledge Acquisition and Integration
In this paper, we propose an integrated information management system in which ontology-based knowledge integration and XML-based text/data retrieval are combined using tag information and ontology management tools. The main purpose of the system is to implement a query answering system for XML-based documents in the domain of molecular biology. The aim is to provide efficient access to heterog...
متن کاملDeferred Incremental Refresh of XML Materialized Views : Algorithms and Performance Evaluation
The view mechanism can provide the user with an appropriate portion of database through data filtering and integration. Views are often materialized for query performance improvement, and in that case, their consistency needs to be maintained against the updates of the underlying data. They can be either recomputed or incrementally refreshed by reflecting only the relevant updates. With the eme...
متن کامل